Data Overview

The data set contains approximately 2276 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.Below is a short description of the variables

  • INDEX - Identification Variable
  • TARGET_WINS - Number of wins
  • TEAM_BATTING_H - Base Hits by batters (1B,2B,3B,HR)
  • TEAM_BATTING_2B - Doubles by batters (2B)
  • TEAM_BATTING_3B - Triples by batters (3B)
  • TEAM_BATTING_HR - Homeruns by batters (4B)
  • TEAM_BATTING_BB - Walks by batters
  • TEAM_BATTING_HBP - Batters hit by pitch (get a free base)
  • TEAM_BATTING_SO - Strikeouts by batters
  • TEAM_BASERUN_SB - Stolen bases
  • TEAM_BASERUN_CS - Caught stealing
  • TEAM_FIELDING_E - Errors
  • TEAM_FIELDING_DP - Double Plays
  • TEAM_PITCHING_BB - Walks allowed
  • TEAM_PITCHING_H - Hits allowed
  • TEAM_PITCHING_HR - Homeruns allowed
  • TEAM_PITCHING_SO - Strikeouts by pitchers

Objective

To build a multiple linear regression model on the training data to predict TARGET_WINS, which is the number of wins for the team.

Data Exploration

##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149
## [1] 2276   17
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286
## [1] "Number of observations:"
## [1] 2276
## [1] "Observations per year, 1871 - 2006:"
## [1] 16.86

Some columns have maximum values that are clearly outliers, like TEAM_PITCHING_H AND TEAM_PITCHING_HR. The assignment mentions that some of the season records were adjusted to match the performance during a 162-game season. There are 2276 seasons in the training set. Observations span 128 years, with an average of 17 teams playing per year. Based on a quick Google search, there were initially 8 teams in the MLB, and 30 teams in 2006. The MLB has two leagues of the same size since 1901, with the league sizes increasing in the late 20th century.

## Warning: Removed 1005 rows containing missing values (geom_point).

## Warning: Removed 2473 rows containing missing values (geom_point).

## Warning: Removed 3090 rows containing non-finite values (stat_boxplot).

## Warning: Removed 388 rows containing non-finite values (stat_boxplot).

Feature Boxplots and Histograms

Based on this quick look of boxplots and histograms, there are several variables with skewed distributions that would benefit from transformation. Additonally, there are a few variables with bimodal distributions.

Looking at the correlation plots, there appear to be several strong correlations between explanatory variables.

Missing values

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##             0.00             0.00             0.00             0.00 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##             0.00             0.00             0.00             4.48 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##             5.76            33.92            91.61             0.00 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##             0.00             0.00             4.48             0.00 
## TEAM_FIELDING_DP 
##            12.57

In terms of missing values, there are two variables missing many obserations. TEAM_BATTING_HBP is missing over 90% of its values, while TEAM_BASERUN_CS is missing just around 30%. Since so many observations are missing, imputing values could change the distributions considerably. To retain as many features as possible, I think it makes sense to explore these two variables first. The other affected missing values only have 5-10% misisng values. None of these appear to be stand-ins for ‘zero’ values, so mean values can be used insead.

I’ll start by looking at TEAM_BATTING_HBP.

I separated training data into two smaller dataframes, one with complete values for HBP and one omitting this variable.

When HBP has values, it appears that there are no major changes in correlations.

There are three new correlaton plots being considered: the first is all datapoints, then a plot with missing hbp values, and finally a plot for rows with hbp values same as the previous pair. There appear to be no major discrepancies between missing values and the overall set. However, comparing missing values to available values does illustrate there are some distinct changes correlation when the hbp was recorded. This may be because HBP only represents only a small proportion of entries and has more variation. However, there also appear to be stronger correlations when HBP is added, which may help predict wins better than omitting altogether.

## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX - TEAM_BATTING_HBP - TEAM_BATTING_SO - 
##     TEAM_BATTING_HR - TEAM_BASERUN_CS - TEAM_BATTING_H - TEAM_BASERUN_SB - 
##     TEAM_PITCHING_BB - TEAM_BATTING_2B - TEAM_BATTING_3B, data = baseball_hbp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8123  -5.9942  -0.0737   5.3098  22.2025 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.916227  19.219854   3.274 0.001269 ** 
## TEAM_BATTING_BB   0.055959   0.009466   5.912 1.61e-08 ***
## TEAM_PITCHING_H   0.026147   0.010184   2.567 0.011041 *  
## TEAM_PITCHING_HR  0.091571   0.024033   3.810 0.000189 ***
## TEAM_PITCHING_SO -0.028772   0.007191  -4.001 9.13e-05 ***
## TEAM_FIELDING_E  -0.173897   0.039905  -4.358 2.18e-05 ***
## TEAM_FIELDING_DP -0.121570   0.035338  -3.440 0.000719 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.473 on 184 degrees of freedom
## Multiple R-squared:  0.5263, Adjusted R-squared:  0.5109 
## F-statistic: 34.07 on 6 and 184 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX - TEAM_PITCHING_BB - TEAM_PITCHING_HR - 
##     TEAM_BATTING_H, data = baseball_hbp_dummy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.3981  -6.6295   0.1545   6.4842  28.2220 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      58.585629   6.442796   9.093  < 2e-16 ***
## TEAM_BATTING_2B  -0.060017   0.009747  -6.158 9.50e-10 ***
## TEAM_BATTING_3B   0.166293   0.022021   7.552 7.51e-14 ***
## TEAM_BATTING_HR   0.100869   0.009172  10.998  < 2e-16 ***
## TEAM_BATTING_BB   0.038251   0.003366  11.363  < 2e-16 ***
## TEAM_BATTING_SO   0.040704   0.009102   4.472 8.35e-06 ***
## TEAM_BASERUN_SB   0.034100   0.008689   3.924 9.10e-05 ***
## TEAM_BASERUN_CS   0.052980   0.018176   2.915  0.00361 ** 
## TEAM_PITCHING_H   0.031740   0.004269   7.435 1.76e-13 ***
## TEAM_PITCHING_SO -0.058995   0.007547  -7.817 1.02e-14 ***
## TEAM_FIELDING_E  -0.158154   0.009939 -15.912  < 2e-16 ***
## TEAM_FIELDING_DP -0.112916   0.013095  -8.623  < 2e-16 ***
## TEAM_HBP_YES_NO  -2.456525   0.923761  -2.659  0.00792 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.532 on 1473 degrees of freedom
##   (790 observations deleted due to missingness)
## Multiple R-squared:  0.4407, Adjusted R-squared:  0.4361 
## F-statistic: 96.71 on 12 and 1473 DF,  p-value: < 2.2e-16

I compared two preliminary linear models that I arrived at through backward selection. Looking only at HBP-containing observations, there’s a small increase in r-squared compared to a model that uses a dummy variable to consider whether values were available.

Next, I’ll look at TEAM_BASERUN_CS, which was missing about 30% of its values.

## [1] 1

Same as HBP, it appears CS did not miscode values of 0 as NA. I’m going to separate the dataset in the same way as HBP to see if there are differences in its distribution and correlation plots.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2706 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1331 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1375 rows containing non-finite values (stat_bin).

After subsetting for availability of CS statistics, an interesting pattern emerges: our three bimodal variables, TEAM_PITCHING_HR, TEAM_BATTING_SO, and TEAM_BATTING_HR, are no longer bimodal.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite values (stat_bin).

As these three histograms illustrate, the bimodal distributions are explained by missing CS values or not. Missing values explain both modes present in the overall histogram.

Fitting a Linear Model

My first change to the data was to eliminate the index and, replace HBP and BASERUN_CS with dummy variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = baseball_df_fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.424  -6.972   0.192   6.983  28.645 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      57.987326   5.994875   9.673  < 2e-16 ***
## TEAM_BATTING_H   -0.027606   0.016383  -1.685 0.092156 .  
## TEAM_BATTING_2B  -0.043602   0.009296  -4.691 2.93e-06 ***
## TEAM_BATTING_3B   0.186256   0.018867   9.872  < 2e-16 ***
## TEAM_BATTING_HR   0.155277   0.081692   1.901 0.057493 .  
## TEAM_BATTING_BB   0.102630   0.042342   2.424 0.015456 *  
## TEAM_BATTING_SO   0.030619   0.021908   1.398 0.162398    
## TEAM_BASERUN_SB   0.068643   0.005505  12.469  < 2e-16 ***
## TEAM_PITCHING_H   0.053979   0.014889   3.625 0.000296 ***
## TEAM_PITCHING_HR -0.052190   0.078164  -0.668 0.504413    
## TEAM_PITCHING_BB -0.064794   0.040239  -1.610 0.107522    
## TEAM_PITCHING_SO -0.047628   0.020830  -2.286 0.022341 *  
## TEAM_FIELDING_E  -0.127819   0.007304 -17.499  < 2e-16 ***
## TEAM_FIELDING_DP -0.104483   0.012332  -8.472  < 2e-16 ***
## TEAM_CS_YES_NO   -3.839845   0.799028  -4.806 1.67e-06 ***
## TEAM_HBP_YES_NO  -2.647365   0.964312  -2.745 0.006104 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 1819 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4152, Adjusted R-squared:  0.4103 
## F-statistic: 86.09 on 15 and 1819 DF,  p-value: < 2.2e-16

The initial linear model explains 41% of variation. Next, I’ll add some log transformations of skewed columns: TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E.

## 
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) + 
##     log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB), data = baseball_df_fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.639  -6.850   0.083   6.851  29.725 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           209.367276  78.152118   2.679 0.007452 ** 
## TEAM_BATTING_H         -0.029153   0.016874  -1.728 0.084208 .  
## TEAM_BATTING_2B        -0.038719   0.009338  -4.146 3.54e-05 ***
## TEAM_BATTING_3B         0.195250   0.019028  10.261  < 2e-16 ***
## TEAM_BATTING_HR         0.142934   0.082824   1.726 0.084560 .  
## TEAM_BATTING_BB         0.140220   0.044797   3.130 0.001775 ** 
## TEAM_BATTING_SO         0.013917   0.022007   0.632 0.527220    
## TEAM_BASERUN_SB         0.084057   0.015811   5.316 1.19e-07 ***
## TEAM_PITCHING_H         0.056607   0.015418   3.671 0.000248 ***
## TEAM_PITCHING_HR       -0.048334   0.079109  -0.611 0.541295    
## TEAM_PITCHING_BB       -0.035418   0.041146  -0.861 0.389466    
## TEAM_PITCHING_SO       -0.053456   0.021981  -2.432 0.015117 *  
## TEAM_FIELDING_E        -0.069984   0.021909  -3.194 0.001426 ** 
## TEAM_FIELDING_DP       -0.104523   0.012260  -8.525  < 2e-16 ***
## TEAM_CS_YES_NO         -3.568860   0.803825  -4.440 9.54e-06 ***
## TEAM_HBP_YES_NO        -3.061485   1.007413  -3.039 0.002408 ** 
## log(TEAM_FIELDING_E)  -11.718073   4.037533  -2.902 0.003749 ** 
## log(TEAM_PITCHING_BB) -36.670662  12.641719  -2.901 0.003767 ** 
## log(TEAM_PITCHING_SO)  17.416259   6.179754   2.818 0.004881 ** 
## log(TEAM_BASERUN_SB)   -2.068123   1.483764  -1.394 0.163538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 1815 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4248, Adjusted R-squared:  0.4188 
## F-statistic: 70.55 on 19 and 1815 DF,  p-value: < 2.2e-16

This model explains more variation, but the F-statistic decreased relative to the original model. Next, I’m going to add a few features I’m curious about. TEAM_BATTING_H considers all base hits, including 2B, 3B,and HR. I will create a new variable only looking at singles called TEAM_BATTING_1B. Related to this, I will also incorporate an approximation of an important baseball statistic, slugging. Because some base hits convert to runs at different rates, slugging weighs, singles, doubles, triples and home runs with increasing weight. Usually, slugging also has a denominator of at-bats, which is unavailable in this dataset. Instead, I’ll approximate this by dividing by the number of hits. The weights I’m assigning are proportional to the number of bases, so 1 for single, 2 for double… 4 for HR.

## 
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) + 
##     log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB), data = baseball_df_fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.610  -6.830   0.047   6.788  29.845 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           178.96333   99.72432   1.795 0.072887 .  
## TEAM_BATTING_H         -0.02227    0.02194  -1.015 0.310078    
## TEAM_BATTING_2B        -0.05084    0.02640  -1.926 0.054310 .  
## TEAM_BATTING_3B         0.17007    0.05471   3.109 0.001909 ** 
## TEAM_BATTING_HR         0.10812    0.10905   0.991 0.321574    
## TEAM_BATTING_BB         0.13896    0.04488   3.096 0.001989 ** 
## TEAM_BATTING_SO         0.01206    0.02233   0.540 0.589305    
## TEAM_BASERUN_SB         0.08407    0.01581   5.316 1.19e-07 ***
## TEAM_PITCHING_H         0.05546    0.01560   3.556 0.000386 ***
## TEAM_PITCHING_HR       -0.05053    0.07925  -0.638 0.523812    
## TEAM_PITCHING_BB       -0.03328    0.04138  -0.804 0.421329    
## TEAM_PITCHING_SO       -0.05302    0.02200  -2.410 0.016069 *  
## TEAM_FIELDING_E        -0.07007    0.02191  -3.197 0.001411 ** 
## TEAM_FIELDING_DP       -0.10475    0.01227  -8.536  < 2e-16 ***
## TEAM_CS_YES_NO         -3.62119    0.81103  -4.465 8.50e-06 ***
## TEAM_HBP_YES_NO        -3.00343    1.01454  -2.960 0.003112 ** 
## TEAM_BATTING_1B              NA         NA      NA       NA    
## TEAM_BATTING_SLG       18.12471   36.91502   0.491 0.623497    
## log(TEAM_FIELDING_E)  -11.66550    4.03980  -2.888 0.003927 ** 
## log(TEAM_PITCHING_BB) -37.13340   12.67944  -2.929 0.003447 ** 
## log(TEAM_PITCHING_SO)  18.47861    6.54882   2.822 0.004829 ** 
## log(TEAM_BASERUN_SB)   -2.03114    1.48598  -1.367 0.171838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 1814 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4249, Adjusted R-squared:  0.4185 
## F-statistic: 67.01 on 20 and 1814 DF,  p-value: < 2.2e-16

Neither of these features offered additional significance. Finally, I’ll use back-selection to eliminate non-contributing variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_1B + log(TEAM_FIELDING_E) + 
##     log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB) - 
##     TEAM_BATTING_SLG - TEAM_PITCHING_H - TEAM_BATTING_BB - TEAM_BATTING_SO - 
##     TEAM_PITCHING_HR - TEAM_PITCHING_BB - TEAM_FIELDING_E, data = baseball_df_fix)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.5371  -6.9106   0.1119   7.0369  28.4639 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -26.850089  38.695974  -0.694 0.487850    
## TEAM_BATTING_H          0.029999   0.004279   7.010 3.34e-12 ***
## TEAM_BATTING_2B        -0.040948   0.009282  -4.412 1.09e-05 ***
## TEAM_BATTING_3B         0.199025   0.019098  10.421  < 2e-16 ***
## TEAM_BATTING_HR         0.094197   0.008831  10.666  < 2e-16 ***
## TEAM_BASERUN_SB         0.073484   0.015580   4.717 2.58e-06 ***
## TEAM_PITCHING_SO       -0.039449   0.006805  -5.797 7.94e-09 ***
## TEAM_FIELDING_DP       -0.103899   0.012282  -8.459  < 2e-16 ***
## TEAM_CS_YES_NO         -3.536451   0.786013  -4.499 7.25e-06 ***
## TEAM_HBP_YES_NO        -3.574125   0.935744  -3.820 0.000138 ***
## log(TEAM_FIELDING_E)  -23.337095   1.331275 -17.530  < 2e-16 ***
## log(TEAM_PITCHING_BB)  17.535210   1.507901  11.629  < 2e-16 ***
## log(TEAM_PITCHING_SO)  16.615041   5.730249   2.900 0.003782 ** 
## log(TEAM_BASERUN_SB)   -1.478015   1.483772  -0.996 0.319324    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 1821 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4148, Adjusted R-squared:  0.4106 
## F-statistic: 99.28 on 13 and 1821 DF,  p-value: < 2.2e-16

Final Model using all Training Ddata

For my final model I considered, I originally modeled all of the dummy variables but they ended up not contributing anything to the model. This final model eliminates several features altogether, transforms three, and considers four different interaction effects.

## 
## Call:
## lm(formula = TARGET_WINS ~ (TEAM_BATTING_H * TEAM_BATTING_2B + 
##     TEAM_BATTING_H * TEAM_BATTING_3B + TEAM_BATTING_H * TEAM_BATTING_HR), 
##     data = baseball_df_fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.807  -8.816   0.569   9.585  58.270 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -7.961e+00  1.291e+01  -0.616 0.537654    
## TEAM_BATTING_H                  4.868e-02  9.039e-03   5.386 7.97e-08 ***
## TEAM_BATTING_2B                -2.802e-02  5.708e-02  -0.491 0.623581    
## TEAM_BATTING_3B                 4.678e-01  1.016e-01   4.605 4.34e-06 ***
## TEAM_BATTING_HR                 2.462e-01  6.928e-02   3.553 0.000388 ***
## TEAM_BATTING_H:TEAM_BATTING_2B  2.323e-05  3.814e-05   0.609 0.542590    
## TEAM_BATTING_H:TEAM_BATTING_3B -2.231e-04  6.497e-05  -3.434 0.000606 ***
## TEAM_BATTING_H:TEAM_BATTING_HR -1.108e-04  4.622e-05  -2.397 0.016596 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 2268 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.2049 
## F-statistic: 84.77 on 7 and 2268 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_SO) + 
##     log(TEAM_BASERUN_SB) + TEAM_BATTING_3B:TEAM_BATTING_HR + 
##     TEAM_BATTING_2B:TEAM_BATTING_HR + TEAM_BATTING_H:TEAM_BATTING_HR + 
##     TEAM_BATTING_H:TEAM_BATTING_3B - TEAM_BATTING_3B - TEAM_BATTING_SO - 
##     TEAM_BATTING_2B - TEAM_BATTING_BB - TEAM_BATTING_HR - TEAM_BATTING_H - 
##     TEAM_BATTING_HR - TEAM_PITCHING_HR, data = baseball_df_fix)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.5312  -6.8053   0.2583   6.9347  28.2779 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -2.351e+01  6.053e+01  -0.388 0.697793    
## TEAM_BASERUN_SB                  8.305e-02  1.569e-02   5.294 1.34e-07 ***
## TEAM_PITCHING_H                  9.712e-04  2.786e-03   0.349 0.727471    
## TEAM_PITCHING_BB                 3.167e-02  2.982e-03  10.620  < 2e-16 ***
## TEAM_PITCHING_SO                -4.877e-02  8.190e-03  -5.954 3.13e-09 ***
## TEAM_FIELDING_E                 -5.022e-02  2.155e-02  -2.330 0.019904 *  
## TEAM_FIELDING_DP                -1.044e-01  1.235e-02  -8.455  < 2e-16 ***
## TEAM_CS_YES_NO                  -3.890e+00  8.114e-01  -4.794 1.77e-06 ***
## TEAM_HBP_YES_NO                 -2.391e+00  1.037e+00  -2.306 0.021229 *  
## TEAM_BATTING_1B                  1.355e-02  1.025e-02   1.322 0.186218    
## TEAM_BATTING_SLG                 1.945e+01  1.843e+01   1.056 0.291165    
## log(TEAM_FIELDING_E)            -1.473e+01  4.006e+00  -3.678 0.000242 ***
## log(TEAM_PITCHING_SO)            2.471e+01  6.587e+00   3.751 0.000181 ***
## log(TEAM_BASERUN_SB)            -2.089e+00  1.473e+00  -1.419 0.156170    
## TEAM_BATTING_3B:TEAM_BATTING_HR -8.015e-04  2.746e-04  -2.918 0.003561 ** 
## TEAM_BATTING_2B:TEAM_BATTING_HR -3.582e-04  7.749e-05  -4.623 4.05e-06 ***
## TEAM_BATTING_H:TEAM_BATTING_HR   1.300e-04  2.427e-05   5.355 9.65e-08 ***
## TEAM_BATTING_H:TEAM_BATTING_3B   1.730e-04  2.327e-05   7.435 1.60e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.04 on 1817 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4231, Adjusted R-squared:  0.4177 
## F-statistic:  78.4 on 17 and 1817 DF,  p-value: < 2.2e-16

The R-squared statistic indicates that this model predicts less than half of the variation in wins with the included features. For a next step, I hope to use cross-validation techniques to split the training data further and allow me to compare RMSE of various models.

Evaluation Data

I also loaded the evaluation data and predicted the wins using my final model. Since the actual wins are withheld, I compared the distribution of predictions to the actual wins in the training set. The means were similar but the training data included much more variation between teams. It’s also worth mentioning as well that using the predict function creates missing values as the evaluation data is missing. In fact, for TEAM_BATTING_HBP, over 90% of rows are missing entries.

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0               18               13 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##               87              240                0                0 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0               18                0               31

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   54.33   75.52   80.85   80.47   85.46  102.01
## [1] 7.711588
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00
## [1] 15.75215